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Abstract 

We introduce a new approach to variable selection, called Predictive Correlation Screening, for 
predictor design. Predictive Correlation Screening (PCS) implements false positive control on the selected 
variables, is well suited to small sample sizes, and is scalable to high dimensions. We establish asymptotic 
bounds for Familywise Error Rate (FWER), and resultant mean square error of a Unear predictor on the 
selected variables. We apply Predictive Correlation Screening to the following two-stage predictor design 
problem. An experimenter wants to learn a multivariate predictor of gene expressions based on successive 
biological samples assayed on mRNA arrays. She assays the whole genome on a few samples and from 
these assays she selects a small number of variables using Predictive Correlation Screening. To reduce 
assay cost, she subsequently assays only the selected variables on the remaining samples, to leam the 
predictor coefficients. We show superiority of Predictive Correlation Screening relative to LASSO and 
correlation learning (sometimes popularly referred to in the hterature as marginal regression or simple 
thresholding) in terms of performance and computational complexity. 

I. Introduction 

Consider the problem of under-determined multivariate linear regression in which training data 
f^i is given and a Unear estimate of the g-dimensional response vector Yj, 1 < z < 

n < p, is desired: 

Yj = aiXji + . . . + BpXip + Cj, l<i<n, (1) 
This research was supported in part by AFOSR grant FA9550-13-1-0043. 
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where Xij is the ith sample of regressor varialbe (covariate) Xj, Yj is a vector of response variables, 
and Sij is the g-dimensional vector of regression coefficients corresponding to Xj, I < i < n,l < j < p. 
There are many applications in which the number p of regressor variables is larger than the number n of 
samples. Such applications arise in text processing of internet documents, gene expression array analysis, 
combinatorial chemistry, and others (Guyon & Elisseeff, 2003). In this p 3> n situation training a linear 
predictor becomes difficult due to rank deficient normal equations, overfitting errors, and high computation 
complexity. Many penalized regression methods have been proposed to deal with this situation, including: 
LASSO; elastic net; and group LASSO (Guyon & EUsseeff, 2003; Tibshirani, 1996; Efron et al., 2004; 
Buehlmann, 2006; Yuan & Lin, 2005; Friedman et al., 2001; Btihlmann & Van De Geer, 2011). These 
methods perform variable selection by minimizing a penalized mean squared error prediction criterion 
over all the training data. The main drawback of these methods is their high computation requirements for 
large p. In this paper we propose a highly scalable approach to under-determined multivariate regression 
called Predictive Correlation Screening (PCS). 

Like recently introduced correlation screening methods (Hero & Rajaratnam, 2011, 2012) PCS screens 
for connected variables in a correlation graph. However, unhke these correlation screening methods, 
PCS screens for connectivity in a bipartite graph between the regressor variables {Xi, . .., Xp} and the 
response variables {Yi, . . . , Yq}. An edge exists in the bipartite graph between regressor variable j and 
response variable k if the thresholded min-norm regression coefficient matrix A = [ai , . . . , a^] has a 
non-zero kj entry. When the j-th column of this thresholded matrix is identically zero the j-th regressor 
variable is thrown out. 

PCS differs from correlation learning, also called marginal regression, simple thresholding, and sure 
independence screening (Genovese et al., 2012; Fan & Lv, 2008), wherein the simple sample cross- 
correlation matrix between the response variables and the regressor variables is thresholded. Correlation 
learning does not account for the correlation between regressor variables, which enters into PCS through 
the pseudo-inverse correlation matrix - a quantity that introduces little additional computational complexity 
for small n. 

To illustrate our method of PCS we apply it to a two-stage sequential design problem that is relevant to 
applications where the cost of samples increases with p. This is true, for example, with gene microarray 
experiments: a high throughput "full genome" gene chip with p = 40, 000 gene probes can be significantly 
more costly than a smaller assay that tests fewer than p = 15, 000 gene probes (see Fig. 1). In this situation 
a sensible cost-effective approach would be to use a two-stage procedure: first select a smaller number 
of variables on a few expensive high throughput samples and then construct the predictor on additional 
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cheaper low throughput samples. The cheaper samples assay only those variables selected in the first 
stage. 

Specifically, we apply PCS to select variables in the first stage of the two-stage procedure. While 
bearing some similarities, our two-stage PCS approach differs from the many multi-stage adaptive 
support recovery methods that have been collectively called distilled sensing (Haupt et al., 2011) in 
the compressive sensing literature. Like two-stage PCS, distilled sensing (DS) performs initial stage 
thresholding in order to reduce the number of measured variables in the second stage. However, in 
distilled sensing the objective is to recover a few variables with high mean amplitudes from a larger 
set of initially measured regressor variables. In contrast, two-stage PCS seeks to recover a few variables 
that are strongly predictive of a response variable from a large number of initially measured regressor 
variables and response variables. Furthermore, unhke in DS, in two-stage PCS the final predictor uses 
all the information on selected variables collected during both stages. 

We establish the following theoretical results on PCS and on the two-stage application of PCS. First, 
we estabhsh Poisson-like limit theorem for the number of variables that pass the PCS screen. This 
gives a Poisson approximation to the probability of false discoveries that is accurate for small n and 
large p. The Poisson-like limit theorem also specifies a phase transition threshold for the false discovery 
probability. Second, with n, the number of samples in the first stage, and t, the total number of samples, 
we estabhsh that n needs only be of order log(p) for two-stage PCS to succeed with high probability in 
recovering the support set of the optimal OLS predictor. Third, given a cost-per-sample that is linear in 
the number of assayed variables, we show that the optimal value of n is on the order of log(t). These 
three results are analogous to theory for correlation screening (Hero & Rajaratnam, 2011, 2012), support 
recovery for multivariate lasso (Obozinski et al., 2008), and optimal exploration vs exploitation allocation 
in multi-armed bandits (Audibert et al., 2007). 

The paper is organized as follows. Section II defines the under-determined multivariate regression 
problem. Section m gives the Poisson-like asymptotic theorem for the thresholded regression coefficient 
matrix. Section IV defines the PCS procedure and associated p-values. Section V defines the two-stage 
PCS and prediction algorithm. Section VI gives theorems on support recovery and optimal sample 
allocation to the first stage of the two-stage algorithm. Section Vn presents simulation results and an 
application to symptom prediction from gene expression data. 



April 11, 2013 



DRAFT 



4 




Fig. 1. Pricing per slide for Agilent Custom Micorarrays G2309F, G2513F, G4503A, G4502A (Feb 2013). The cost increases 
as a function of probeset size. Source: BMC Genomics and RNA Profiling Core. 



II. Under-determined multivariate regression problem 

Assume X = [Xi , . . . , Xp] and Y = [li , . . . , Fg] are random vectors of regressor and response 
variables, from which n observations are available. We represent the n x p and n x q data matrices as 
X and Y, respectively. We assume that the vector X has an elliptically contoured density with mean 
fix and non-singular p x p covariance matrix "Sx, i.e. the probability density function is of the form 
/x(x) = g ((x — fix)^'^x~^{'^ — fJ'x)), in which 5 is a non-negative integrable function. Similarly, the 
vector Y, is assumed to follow an elliptically contoured density with mean fiy and non-singular q x q 
covariance matrix Yly. We assume that the joint density function of X and Y is bounded and differentiable. 
Denote the p x q population cross covariance matrix between X and Y by "Exy 

The p X p sample covariance matrix S for data X is defined as: 

1 — — 

i=l 

where X(j) is the ith row of data matrix X, and X is the vector average of aU n rows of X. 

Consider the nx (p + q) concatenated matrix Z = [X, Y]. The sample cross covariance matrix S^^ is 
defined as the lower left q x p block of the {p + q) x {p + q) sample covariance matrix obtained by (2) 
using Z as the data matrix instead of X. 

Assume that p ^ n. We define the ordinary least squares (OLS) estimator of Y given X as the 
min-norm solution of the underdetermined least squares regression problem 

min||Y'^ - BX^Ill^, (3) 
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where ||A||f represents the Frobenius norm of matrix A. The min-norm solution to (3) is the q x p 
matrix of regression coefficients 

B = s2'^(s^)^ (4) 

where denotes the Moore-Penrose pseudo-inverse of matrix A. If the iHi column of B is zero then 
the ith variable is not included in the OLS estimator. This is the main motivation for the proposed partial 
correlation screening procedure. 

The PCS procedure for variable selection is based on the U-score representation of the correlation 
matrices. It is easily shown that there exist matrices and U^' of dimensions (n — 1) x p and {n — l)xq 
respectively, such that the columns of and lie on the (n — 2)-dimensional unit sphere 

Sn—2 in 

M"~^ and the following representations hold (Hero & Rajaratnam, 2012): 

sy- = -Dl{{uyfi]^)-Dl, (5) 

and: 

(S^)t = Dgi ((U^)^(U^(U^)^)-2u^)Dsi , (6) 

where Dm denotes the diagonal matrix obtained by zeroing out the off-diagonals of matrix M. Note 
that and are constructed from data matrices X and Y, respectively. 

Throughout this paper, we assume the data matrices X and Y have been normahzed in such a way 
that the sample variance of each variable Xi and Yj is equal to 1 for 1 < z < p and 1 < j < q. This 
simplifies the representations (5) and (6) to 8^=^ = {WfU"^ and (S=^)t = (U^)^(U^(U^)^)-2u^. Using 
these representations, one can write: 

Y = S^'^CS^jtx = (U2')'^(U^(U^)^)-^U^X. (7) 

Defining = {W{Wf)-^WU~J^^^^^^^^^^,^_,^^, we have: 

Y = (U^)^U^D^\j,^^(^,^^,^^^_2^,X (8) 

= (H^^)'^D^^jj^^r^jj^^jj,^T)-2u»X5 (9) 

where 

H^y = (fj^fijy^ (10) 

Note that the columns of matrix lie on Sn-2- This can simply be verified by the fact that diagonal 
entries of the px p matrix (U^)-^U^ are equal to one. 
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The U-score representations of covariance matrices completely specify the regression coefficient matrix 
S2'^(S^)t. 

We define variable selection by discovering columns of the matrix (11) that are not close to zero. The 
expected number of discoveries will play an important role in the theory of false discoveries, discussed 
below. 

From Sec. II we obtain a U-score representation of the regression coefficient matrix: 

o {a ) — yn ) -l»(u-)t'(u-(U-)^)-2U- • 

Under the condition that D(Ui=)r(u»(u»)T)-2U» has non-zero diagonal entries, the zth column of S^^(S^)^ 
is a zero vector if and only if the zth row of H^^ is a zero vector, for 1 < z < p. This motivates screening 
for non-zero rows of the matrix H^^ instead of columns of S^^(S^)^. 

Fix an integer 5 G {1, 2, • • • ,p} and a real number p G [0, 1]. For each 1 < i < p, we call i a 
discovery at degree threshold 5 and correlation threshold p if there are at least 5 entries in zth row of 
IrL^y of magnitude at least p. Note that this definition can be generalized to an arbitrary matrix of the 
form (U^)^U^ where and are matrices whose columns lie on S'n-2- For a general matrix of the 
form (1U^)^U2' we represent the number of discoveries at degree level 6 and threshold level p as iV^^. 

III. Asymptotic theory 

The following notations are necessary for the propositions in this section. We denote the surface area 
of the (n — 2)-dimensional unit sphere Sn-2 in M"~^ by an. Assume that U, V are two independent and 
uniformly distributed random vectors on S'n-2- For a threshold p G [0, 1], let r = — p). Pq is then 

defined as the probability that either ||U — V||2 < r or ||U-|-V||2 < r. Pq can be computed using the 
formula for the area of spherical caps on -S'n_2 (Hero & Rajaratnam, 2012). 

Define the index set C as: 

C = {(io,n, ■■■,is) ■ 

l<io<P,'i-<k< ■■■<is<q}- (12) 

For arbitrary joint density /uo,. -,U5(uo, • • • , U5) defined on the Cartesian product = 'S'n-2 x • • • x 
S'n-2, define /uj,u^i,...,U5!, (uo, ui, . . . , u^) as the average of 

/u,(soUo, siui, . . . , ssus) = 
fuf (sqUo, SlUi, . . . , ssus), (13) 
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for all i = {io,ii, • • . ,^^) G C and sj G {—1, 1},0 < j < S. 

In the following propositions, k represents an upper bound on the number of non-zero entries in any 
row or column of covariance matrix "Ex or cross covariance matrix 'Exy. We define ||^pgnfe<5lli ~ 



1^1 ^ Z^ieC g n <5(^)' average dependency coefficient, as the average of 



(14) 



in which ^fc(io) is defined as the set complement of the union of indices of non-zero elements of the 
io-th column of Jlyx'^x^- Finally, the function J of the joint density /uo,...,U5(uo, • • • ,us) is defined as: 



^(/uo,...,uJ = |'S'„_2|'^ / /uo,...,u,(u,...,u)du. 



(15) 



The following proposition gives an asymptotic expression for the number of discoveries in a matrix of 
the form (U^)^ILJ^, as p — > oo, for fixed n. Also it states that, under certain assumptions, the probabiUty 
of having at least one discovery converges to a given limit. This limit is equal to the probability that a 
certain Poisson random variable Ng^^ with rate equal to limp_^oo ^Ws^p ] takes a non-zero value, i.e. it 
satisfies: Nt > 0. 

Proposition 1: Let = [Uf , Uf , U^] and U^^ = [U^', U|, U^] be (n - 1) x p and (n - 1) x g 
random matrices respectively, with Uf,Uj G Sn-2 for 1 < i < p,l < j < q. Fix integers S > 1 
and n > 2. Assume that the joint density of any subset of {U'J', ...U^, U^, U^} is bounded and 
differentiable. Let {pp}p be a sequence in [0, 1] such that pp ^ 1 as p ^ oo andpJg(l — p^)^^ e„,5. 
Then, 

lim E[N^y ] = Urn ep,5,nAPp-^(/u5,U^,,...,U^J 



p— >-oo 



where ip,q,n,5,Pp = PiD^o and Kn,s = (en,5a„/(n - 2)f /S\. 

Assume also that k = o{{p^q)^^^^'^^'') and that the average dependency coefficient satisfies 



limp^^llA^f^ ^^^^^^lli = 0. Then: 



with 



f«l>0)^l-exp(-Af), (17) 



^T=^^T^mZJ. (18) 



Proof of Proposition 1: See appendix. 
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The following proposition states that when the rows of data matrices X and Y are i.i.d. elliptically 
distributed with block sparse covariance matrices, the rate (16) in Proposition 1 becomes independent of 

Sa; and T^xy. Specifically, the {5 + l)-fold average J(/uj,u?j,...,u?^) converges to 1 while the average 
dependency coefficient 11^^^^^, ^||i goes to 0, as p — )■ oo. This proposition will play an an important 
role in identifying phase transitions and in approximating p- values. 

Proposition 2: Assume the hypotheses of Prop. 1 are satisfied. In addition assume that the rows of 
data matrices X and Y are i.i.d. elhptically distributed with block sparse covariance and cross covariance 
matrices "Ex and "^xy Then A^^ in the limit (18) in Prop. 1 is equal to the constant 5 given in (16). 
Moreover, IJx '^Ux- 
Proof of Proposition 2: See appendix. 

IV. Predictive Correlation Screening 
Under the assumptions of Propositions 1 and 2: 

Pi^sX > 0) ^ 1 - eM-^P,q,n,5,P,) as p ^ 00 (19) 

Using the above limit, approximate p-values can be computed. Fix a degree threshold 6 < q and a 
correlation threshold p* G [0,1]. Define ^p. (H^^') as the undirected bipartite graph (Fig. 2) with parts 
labeled x and y, vertices {Xi,X2, ...,Xp} in part x and {¥±,¥2, ...,Yq} in part y. For I < i < p and 
1 < J < 9> vertices Xi and ¥j are connected if \h^j\ > p*, where h^J is the (i, j)th entry of H^^/ defined 
in (10). Denote by df the degree of vertex Xi in ^p. (H^^). For each value S e {!,■ ■ ■ ,maxi<i<pdf}, 
and each i, 1 < i < p, denote by pi{S) the maximum value of the correlation threshold p for which 
df > 5 in QpCH^y). pi{d) is in fact equal to the 6th largest value \h^j\, 1 < j < q. pi{S) can be computed 
using Approximate Nearest Neighbors (ANN) type algorithms (Jegou et al., 2011; Arya et al., 1998). 
Now for each i define the modified threshold pf°'^{5) as: 

pr\6) = WiPi{d), l<i<p, (20) 

where Wi = D{i)/ J^j^i in which D{i) is the ith diagonal element of the diagonal matrix 

■^(u-)r(u-(u-)^)--'u- (recall Sec. II). 

Using Propositions 1 and 2 the p-value associated with variable X^ at degree level 6 can be approxi- 
mated as: 

pvs{i) ~ 1 - exp(-^p^q ,^^^^^,.„d(5)). (21) 
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Fig. 2. Predictive correlation screening threstiolds the matrix H^" in (11) to find variables Xi that are most predictive of 
responses Yj. This is equivalent to finding sparsity in a bipartite graph CJp»(H'^^) with parts x and y which have p and q 
vertices, respectively. For 1 < i < p and 1 < j < q, vertex Xi in part x is connected to vertex Y, in part y if | h^^ | > p* ■ 



The set of p- values (21), i = 1,. . . ,p, provides a measure of importance of each variable Xi in predicting 
Yj's. Under a block-sparsity null hypothesis, the most important variables would be the ones that have the 
smallest p-values. Similar to the result in (Hero & Rajaratnam, 2011, 2012), there is a phase transition 
in the p-values as a function of threshold p. More exactly, there is a critical threshold pc,s such that if 
P > Pc,5' the average number E[Ng^^] of discoveries abruptly decreases to and if p < pc,s the average 
number of discoveries abruptly increases to p. The value of this critical threshold is: 

Pes = ^1 - «>)-''/('^"-')-'^ (22) 

where = a„5J(/uj,u^j,...,u'4 )■ When 6 = 1, the expression given in (22) is identical, except for the 
constant c^^^, to the expression (3.14) in (Hero & Rajaratnam, 2011). 

Expression (22) is useful in choosing the PCS correlation threshold p*. Selecting p* slightly greater 
than pc^s will prevent the bipartite graph C/p. (H^^) from having an overwhelming number of edges. 

Normally 6 = 1 would be selected to find all regressor variables predictive of at least 1 response 
variable Yj. A value of 6 = d > 1 would be used if the experimenter were only interested in variables 
that were predictive of at least d of the responses. Pseudo-code for the complete algorithm for variable 
selection is shown in Fig. 3. The worse case computational complexity of the PCS algorithm is only 
0{np\ogq). 



V. Two- STAGE PREDICTOR DESIGN 

Assume there are a total of t samples {Yj, Xj}*^]^ available. During the first stage a number n < t of 
these samples are assayed for all p variables and during the second stage the rest of the i — n samples 
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• Initialization: 

1) Choose an initial threshold p* > /9c,5 

2) Calculate the degree of each vertex on 
side X of the bipartite graph Qp* (H.^^) 

3) Select a value of 6 G 
{1, • • • ,maxi<i<pdf} 

• For each i = 1, - ■ ■ ,p find pi{6) as the Sth 
greatest element of |, 1 < j < q} 

• Compute using (20) 

• Approximate the p-value corresponding to 
the ith independent variable Xi as pvs {i) ~ 

• Screen variables by thresholding the p- 
values pvs{i) at desired significance level 

Fig. 3. Predicive Correlation Screening (PCS) Algorithm 

are assayed for a subset of k < p of the variables. Subsequently, a /c-variable predictor is designed using 
all t samples collected during both stages. The first stage of the PCS predictor is implemented by using 
the PCS algorithm with 6 = 1. 

As this two-stage PCS algorithm uses n and t samples in stage 1 and stage 2 respectively, we denote the 
algorithm above as the n\t algorithm. Experimental results in Sec. Vn show that for n <^p,if LASSO or 
correlation learning is used instead of PCS in stage 1 of the two-stage predictor the performance suffers. 
An asymptotic analysis (as the total number of samples t — > oo) of the above two-stage predictor can 
be performed to obtain optimal sample allocation rules for stage 1 and stage 2. The asymptotic analysis 
discussed in Sec. VI provides minimum Mean Squared Error (MSE) under the assumption that n, t, p, 
and k satisfy the budget constraint: 

np+ {t — n)k < jjL, (23) 

where jjl is the total budget available. The motivation for this condition is to bound the total sampling 
cost of the experiment. 
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VI. Optimal stage- wise sample allocation 

We first give theoretical upper bounds on the Family-Wise Error Rate (FWER) of performing variable 
selection using p-values obtained via PCS. Then, using the obtained bound, we compute the asymptotic 
optimal sample size n used in the first stage of the two-stage predictor, introduced in the previous section, 
to minimize the asymptotic expected MSE. 

We assume that the response Y satisfies the following ground truth model: 

Y = Sii^Xi^ + ai^Xi^ + ■ ■ ■ + a,, + N, (24) 

where ttq = {ii, • • • , ife} is a set of distinct indices in {!,... ,p}, X = [Xi,X2, • • • , Xp] is the vector of 
predictors, Y is the g-dimensional response vector, and N is a noise vector statistically independent of 
X. Xi^ , • ■ ■ 5 ^6 called active variables and the remaining p— k variables are called inactive variables. 
We assume that the p-dimensional vector X follows a multivariate normal distribution with mean and 
px p covariance matrix S = [o'ij]i<i,j<p, where E has the following block diagonal structure: 

aij = aji = 0, V z G TTo, j G {1, • • • ,p}\7i"o- (25) 

In other words active (respectively inactive) variables are only correlated with the other active (respectively 
inactive) variables. Also, we assume that N follows a multivariate normal distribution with mean and 
covariance matrix alqxq- 

We use the PCS algorithm of Sec. IV with S = 1 to select the k variables with the smallest p-values. 
These selected variables will then be used as estimated active variables in the second stage. The following 
proposition gives an upper bound on the probability of selection error for the PCS algorithm. 

Proposition 3: If n > 0(logp) then with probabihty at least 1 — q/p, PCS recovers the exact support 

TTq. 

Proof of Proposition 3: See appendix. □ 
Proposition 3 can be compared to Thm. 1 in (Obozinski et al., 2008) for recovering the support ttq by 
minimizing a LASSO-type objective function. The constant in 0(logp) of Prop. 3 is increasing in the 
dynamic range coefficient 

1=1, ••• ,q minjgTro 

where B = [bi, • • • , bp] = S^/^A. The worst case (largest constant in 6(logp)) occurs when there is 
high dynamic range in some rows of the q x p matrix B. 

The following proposition states the optimal sample allocation rule for the two-stage predictor, as 
i — >■ oo. 
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Proposition 4: The optimal sample allocation rule for the two-stage predictor introduced in Sec. V 
under the cost condition (23) is 

( O(logt), c{p-k)logt + kt<ii 
n = < (27) 

I 0, o.w. 
Proof of Proposition 4: See appendix. □ 
Proposition 4 implies that for a generous budget (/x large) the optimal first stage samphng allocation 
is log(t). However, when the budget is tight it is better to skip stage 1 (n = 0). Figure 4 illustrates the 
allocation region as a function of the sparsity coefficient p= 1 — k/p. 

VII. Simulation RESULTS 

a) Efficiency of Predictive Correlation Screening.: We illustrate the performance of the two- stage 
PCS algorithm and compare to LASSO and correlation learning methods (Tibshirani, 1996; Genovese 
et al, 2012). 

In the first set of simulations we generated an n x p data matrix X with independent columns, each of 
which is drawn from a p-dimensional multivariate normal distribution with identity covariance matrix. 
The qx p coefficient matrix A is then generated in such a way that each column of A is active with 
probability 0.1. Each active column of A is a random g-dimensional vector with i.i.d. M{Q, 1) entries, 
and each inactive column of A is a zero vector. Finally, a synthetic response matrix Y is generated by 
a simple hnear model 

= AX^ + N^, (28) 

where N is ra x g noise matrix whose entries are i.i.d. Ar(0, 0.05). The importance of a variable is 
measured by the value of the £2 norm of the corresponding column of A. Note that the linear model in 
(28) trivially satisfies the block sparsity assumptions on the covariance matrices in Prop. 2. 

We implemented LASSO using an active set type algorithm - claimed to be one the fastest methods for 
solving LASSO (Kim & Park, 2010). We set the number of regressor and response variables io p = 200 
and q = 20, respectively, while the number of samples n was varied from 4 to 50. Figure 5 shows the 
average number of mis-selected variables for both methods, as a function of n. The plot is computed by 
averaging the results of 400 independent experiments for each value of n. Figure 6 shows the average 
run time on a logarithmic scale, as a function of n (MATLAB version 7.14 running on 2.80GHz CPU). 
As we see, for low number of samples, PCS has better performance than LASSO and is significantly 
faster. 
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Fig. 4. Left: surface ^/p = p\ogt+(l — p)t. Right: contours indicating optimal allocation regions for /i/p = 30 and p/p — 60. 
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Fig. 5. Average number of mis-selected variables for active set implementation of LASSO (dashed) vs. Predictive Correlation 
Screening (solid), p = 200, q — 20. 
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Fig. 6. Average CPU time for active set implementation of LASSO (dashed) vs. PCS (solid), p = 200, q = 20. 



April 11, 2013 



DRAFT 



14 




150 200 250 300 350 400 450 500 



n 

Fig. 7. Average number of mis-selected variables. Active set implementation of LASSO (red-dashed) vs. correlation learning 
(green-dashed) vs. PCS (solid), p = 10*, q ~ I- 

To illustrate PCS for a higher dimensional example, we set p = 10^, g = 1 and compared PCS with 
LASSO and also the correlation learning method of (Genovese et al., 2012), for a small number of 
samples. Figure 7 shows the results of this simulation over an average of 400 independent experiments 
for each value of n. In this experiment, exactly 100 entries of A are active. The active entries are i.i.d. 
draws of M{0, 1) and inactive entries are equal to zero. Unlike Fig. 5, here the regressors variables are 
correlated. Specifically, Xi, - ■ ■ ,Xp, are i.i.d. draws from a multivariate normal distribution with mean 
and block diagonal covariance matrix satisfying (25). As we see for small number of samples, PCS 
performs significantly better in selecting the important regressor variables. 

b) Efficiency of The Two-stage Predictor.: To test the efficiency of the proposed two-stage predictor, 
a total of t samples are generated using the linear model (28) from which n = 25 log t are used for the 
task of variable selection at the first stage. All t samples are then used to compute the OLS estimator 
restricted to the selected variables. We chose t such that n = (130 : 10 : 200). The performance is 
evaluated by the empirical MSE:= YlT^iO^i ~ ^j)^/"^; where m is the number of simulation trials. 
Similar to the previous experiment, exactly 100 entries of A are active and the regressor variables follow 
a multivariate normal distribution with mean and block diagonal covariance matrix of the form (25). 
Figure 8 shows the result of this simulation for p = 10^ and q = I. Each point on the plot is an average 
of 100 independent experiments. Observe that in this low sample regime, when LASSO or correlation 
learning are used instead of PCS in the first stage, the performance suffers. 

c) Estimation of FWER Using Monte Carlo Simulation.: We set p = 1000, k = 10 and n = (100 : 
100 : 1000) and using Monte Carlo simulation, we computed the probability of error (i.e. when the exact 
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1.3 




IsO 140 150 160 170 180 190 200 
n = 25 log (t) 



Fig. 8. Prediction MSB for tiie two-stage predictor when n = 25 log t samples are used for screening at the first stage and all 
t samples are used for computing the OLS estimator coefficients at the second stage. The solid plot shows the MSB when PCS 
is used in the first stage while the red and green dashed plots show the MSB when PCS is replaced with LASSO and correlation 
learning, respectively. Here, p = 10* and q = 1. The Oracle OLS (not shown), which is the OLS predictor constructed on the 
true support set, has average MSB performance that is a factor of 2 lower than the curves shown in the figure. This is due to 
the relatively small sample size available to these algorithms. 

support is not recovered) for the PCS. In order to prevent the ratios \aj\/ X^isTro I'^H; J ^ from getting 
close to zero, the active coefficients were generated via a Bernoulli-Gaussian distribution of the form: 

a ~ 0.5AA(1, cT^) + 0.5AA(-1, cT^), (29) 

Figure 9 shows the estimated probabilities. Each point of the plot is an average of = 10'^ experiments. 
As the value of a decreases dynamic range coefficient (26) goes to infinity with high probability and the 
probability of selection error degrades. As we can see, the FWER decreases at least exponentially with 
the number of samples. This behavior is consistent with Prop. 3. 

d) Application to Experimental Data.: We illustrate the application of the proposed two-stage 
predictor on the Predictive Health and Disease dataset, which consists of gene expression levels and 
symptom scores of 38 different subjects. The data was collected during a challenge study for which 
some subjects become symptomatically ill with the H3N2 flu virus (Huang et al., 2011). For each subject, 
the gene expression levels and the symptoms have been recorded at a large number of time points that 
include pre-inoculation and post-inoculation sample times. 10 different symptom scores were measured. 
Each symptom score takes an integer value from to 4, which measures the severity of that symptom 
at the corresponding time. The goal here is to learn a predictor that can accurately predict the symptom 
scores of a subject based on his measured gene expression levels. 
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n 



Fig. 9. Probability of selection error as a function of number of samples for PCS. The entries of the coefficient matrix are 
i.i.d. draws from distribution (29). 

The number of predictor variables (genes) selected in the first stage is restricted to 50. Since, the 
symptom scores take integer values, the second stage uses multinomial logistic regression instead of the 
OLS predictor. The performance is evaluated by leave-one-out cross validation. To do this, the data from 
all except one subject are used as training samples and the data from the remaining subject are used 
as the test samples. The final MSE is then computed as the average over the 38 different leave-one-out 
cross validation trials. In each of the experiments 18 out of the 37 subjects of the training set, are used 
in first stage and all of the 37 subjects are used in the second stage. It is notable that except for the first 
two symptoms, PCS performs better in predicting the symptom scores. 

Note that, in this experiment, each symptom is considered as a one dimensional response and the 
two-stage algorithm is applied to each symptom separately. 

VIII. Conclusion 

We proposed an algorithm called Predictive Correlation Screening (PCS) for approximating the p-values 
of candidate predictor variables in high dimensional linear regression under a sparse null hypothesis. 
Variable selection was then performed based on the approximated p-values. PCS is specifically useful 
in cases where n <^ p and the high cost of assaying all regressor variables justifies a two-stage design: 
high throughput variable selection followed by predictor construction using fewer selected variables. 
Asymptotic analysis and experiments showed advantages of PCS as compared to LASSO and correlation 
learning. 
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Symptom 


MSB: LASSO 


MSB: PCS 


Runny Nose 


0.3346 


0.3537 


Stuffy Nose 


0.5145 


0.5812 


Sneezing 


0.4946 


0.3662 


Sore Throat 


0.3602 


0.3026 


Earache 


0.0890 


0.0761 


Malaise 


0.4840 


0.3977 


Cough 


0.2793 


0.2150 


Shortness of Breath 


0.1630 


0.1074 


Headache 


0.3966 


0.3299 


Myalgia 


0.3663 


0.3060 


Average for all symptoms 


0.3482 


0.3036 



TABLB I 

MSB OF THE TWO-STAGE LASSO PREDICTOR AND THE PROPOSED TWO-STAGE PCS PREDICTOR USED FOR SYMPTOM 
SCORE PREDICTION. THE DATA COME FROM A CHALLENGE STUDY EXPERIMENT THAT COLLECTED GENE EXPRESSION AND 

SYMPTOM DATA FROM HUMAN SUBJECTS (HUANG ET AL., 201 1). 
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IX. Appendix 

Proof of Prop. 1: 

Define ip^ = I{df > 6), where df is the degree of vertex i in part x in the thresholded correlation 
graph. We have: N^^^ = Yl^=i • Define 4>^j = ^ ^('"j Uf )), where A{r, Uf ) is the union of two 

anti-polar caps in Sn-2 of radius a/2(1 — p) centered at Uf and — U?. (/>f can be expressed as: 

^r = E E fl<^ n (i-CD' (30) 

'="5 keC{q,l) J=l m=«+l 

where = {ki, kq) and (7(9, 1) = {k : ki < k2 < ■■■ < ki, ki+i < ... < kq, kj G {1, 2, q}, ki 7^ kj}. 
By subtracting Y.kec{q,i) 11^=1 both sides, we get: 

keC{q,l) J=i 

E E n + 

E E (-ir-^nc! 

fe6C'(q,(5) "»='5+l 3=1 
m 

E n (31) 

fc5+i<-<fe;.,{fej+i, ■■•,fe;»}c{fe5+i, .••,*;<,} "=<5+i 
The following inequaUty will be helpful: 

k 

^[11'^^!]=/ dv [ dui... [ du, 

JSn-2 JAir,v) JA{r,v) 

fuf^,...,ur^,ut{ui, ■■■,Uk,v) (32) 
< Po'a^M^^,, (33) 

where M^^^ =maXi^^„.^i^^i\\fuy^^_^^,ufJur\\oo- 
Also we have: 

^nCJ^^o'-CMi^Qr (34) 

1=1 

where Q = unique{{ii,ji}) is the set of unique indices among the distinct pairs and M^q^ 

is a bound on the joint density of Uq'. 
Now define: 

E n«,- (35) 
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Now, we show that 

\m] - i^^mw < i.aqPoY^'^ (36) 

where 7^ ,5 = 2emax.s+i<i<q{anM^^^}. To show this, take expectations from both sides of equation (31) 
and apply the bound (33) to obtain: 

l=S+l ^ ^ 



< maxs+i<i<g{a^^M^^^} 
i=s+i ^ ^ ^ ^ /=i ^ ^ 

1=1 

< maxs+i<i<g{a'^Mfp2e{qPoY+\ (37) 
in which, the third inequahty follows from the assumption qPo < 1 along with the inequality : 

k=s+l ^ ^ fc=s+l 
A;=0 

Application of the mean value theorem to the integral representation (32) yields: 

\mi\ - ^o'-/(/u^„...,ux„Uf)l < ll%^P^)'r. (39) 
where 7^^ = 2a^^^ M^^^^^^j b\ and is a bound on the norm of the gradient: 

Vur,,...,ur/m„...,u^,iuf(ur^,-,ufjuf). (40) 

Combining (37) and (39) and using the relation r = 0((1 — p)^^^) we conclude: 

l^[</'f] - Q^'o^(/u?,u^„...,u«,)l < 

0(/(gPo)'max{pPo, (1 - p)'/'})- (41) 
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Summing up over i we conclude: 



0{p{pPoYmax{pPo, (1 - p)^/^) (42) 

= o((<l5) WO-^ (1 - P)'^'}), 

where rj^^ ^ = p^/'^gPo. This concludes (16). 

To prove the second part of the theorem, we use Chen-Stein method (Arratia et al., 1990). Define: 

E nci- (43) 

0<io<p,0<ii<...<is<q j=l 

Assume the vertices i in part x and y of the thresholded graph are shown by and respectively, for 
t = define the index set = = {(jo, Ji--' ) ■ Ji ^ ^k^i^i) ^ifjf ^ 

U if , / = 1, 6} n CZ' where C^^ = {(jo, -Js) ■ 1 < jo < pA < ji < ■■■ < js < q}- Note 
that \B^y\< k^+'^. We have: 

Assume -A^^^^ is a Poisson random variable with E[Ng^J^] = N^^. Using theorem 1 of (Arratia et al., 
1990), we have: 



where: 



2 maxA \piNsl ^ ^) - 7 A)\ < h + b2 + h, (45) 



5 (5 

= E E ^[ncincj' (47) 

andfor^^.,=E[nLiCJ: 

^3 = E ^[^tn C - ^'r- l-^l ^ ^ ^f]] • (48) 

Using the bound (34), £;[nf=i ^^jj is of order O(Po^). Therefore: 

hi < 0{pq^k^+^P^^) = 
= 0{{vZ,5fik/{p^^q^Af^^). (49) 
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Note that, since j, nf=i 4>^oii \]^m=i ^^ojm ^ multiplication of at least (5 + 1 different characteristic 
functions. Hence by (34), 

s s 

EiIl^Z,Il<I^ZJ = OiPn- (50) 

1=1 m=l 

Hence, 62 < 0{pq^k^+^P^-^^) = 0{{rfJ^^^^^y+^{k/{p\qY/^^+^'^Y+^). Finally, to bound 63 we have: 

^3 = E ^[^in - Pr^^ lu^ri^]] = (51) 

= E L.,r„^ (n/ -i^s/ 

^ /u;''(Uf) ^ 

/u-(Up)/u.»„,,(u^..(^) (52) 



Therefore: 



|p(iV,^^>0)-(l-exp(-A7))|< 

\p(N^y>o)-(N^y>o)\ + 
\p{N^y>o)-ii-cxp{-E[N^y]))\ + 

|exp(-^[iV,^^])-exp(-Af)| 

< + 61 + 62 + 63 + 0{\E[N^y] - Af I). (53) 

Hence, it remains to bound 0(|£'[iV|^^] — A^^|). Application of mean value theorem to the multiple 
integral (32) gives: 

\E[Yl ^1:] - ^o'^(/ur,,...,ur,,Uf )l < 0{Pir). (54) 

1=1 

Using relation (44) we conclude: 

Oipq'P^r) = 0{{vZ,s)'r). (55) 

Combining this with inequality (53) along with the bounds on bi, 62 and 63, completes the proof of (17). 

□ 
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Proof of Prop. 2: 

We prove the more general proposition below. Prop. 2 is then a direct consequence. 

Proposition: LetX and Y be nxp and nxq data matrices whose rows are i. i. d. realizations of elliptically 
distributed p-dimensional and q-dimensional vectors X and Y with mean parameters fix <^nd fiy and 
covariance parameters cmd Sy, respectively and cross covariance S^:y. Let = [Uf , . . . , U^] and 

= [U]', . . . , Uq] be the matrices of correlation U- scores. Assume that the covariance matrices 
and Tiy are block-sparse of degrees dx and dy, respectively (i.e. by rearranging their rows and columns, 
all non-diagonal entries are zero except a dx x dx or a dy y. dy block). Assume also that the cross 
covariance matrix is block-sparse of degree di for x and degree d2 for y (i.e. by rearranging its 
rows and columns, all entries are zero except a di x d2 block), then 

W = W{l + 0{dx/p)). (56) 

Also assume that for 5 >1 the joint density of any distinct set of V -scores Uf , U^^ , • • • , U^^ is bounded 
and differentiable over Then the [5 + l)-fold average function J(/u»',ux ,...,UX ) cind the average 

dependency coefficient W^^^kiW ^^^isfy 

J{fv^.,ut„...,vO = 1 + 0(max{|, <5^^Y^}), (57) 
IIM„,mIIi = 0- (58) 



Furthermore, 



-dx di {dy - 1)^ 



Jifij,,vt,,...,uO = l + 0(max{^,^,<^^^}) (59) 
\\^Z,n,k,5h = 0{{dx/p)). (60) 



Proof: We have: 

fj- = (U-(U-)^)-iu-D-^t),(^,(^,).)_,^,. (61) 
By block sparsity of S^jU^ can be partitioned as: 



U^ = [U^,U''], (62) 



where = [Uf , • • • , ] and = [U^, • • • , Up_^ J are dependent and independent columns of U^, 
respectively. Similarly, by block sparsity of Hy, 

uy = [iiy,W], (63) 
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where = [JJ^, ■ ■ ■ ] and = [U^, • • • ,U^_(i^] are dependent and independent columns of 
U^', respectively. By block sparsity of ^xy, at most di variables among U^, • • • ,Up_^ , are correlated 
with columns of U^. Assume the correlated variables are among U^, • • • jV^-y Similarly, at most d2 
variables among U^, • • • , U^^^^ are correlated with columns of U^. Without loss of generality, assume 

— y — y 

the correlated variables are among U^^ , • • • , U^^ . 
The columns of U"^, are i.i.d. and uniform over the unit sphere S'n-2- Therefore, as p ^ oo: 

1 fT-(rf ^ £;[Ui(Uif ] = (64) 



p — dx n — 1 

Also, since the entries of l/cJ^n^lM^)^ are bounded by one, we have 

1 



iriirf = o{dx/p), (65) 
p 



where 0{u) is an (ra — 1) x (n — 1) matrix whose entries are 0{u). Hence: 

1 



^ ■-iIn-l + 0{dx/p))-'U' 

n — 1 

= —U-{l + Oidx/p)). (66) 



Hence, as p ^ oo: 



Thus: 



= (!^)2(U-)^U-(1 + 0{dx/p)). (67) 

D(u»)^(u»(u-)^)-2u» = + 0{dx/p))^ . (68) 

Combining (68) and (66) concludes (56). 

Now we prove relations (57) and (58). Define the partition C = V [JV^ oi the index set C defined 
in (12), where V = {i = (io,«i, • • • ,^5) : io is among p — di columns of that are uncorrelated of 
columns of and at most one of n, • • • ,^5 is less than or equal to dy] is the set of {6 + l)-tuples 
restricted to columns of and that are independent. We have: 

J(/us,u^,,...,u^,) = \C\-'2-' E 

(E+E)'^(/^oUr„.^iur,,...,..u«p, (69) 
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and 

But, J'(/soU»^^,siUV^,...,s5U«_j) = 1 for P and ^p^g^n,k,si^) = for ?G C. Moreover, we have: 

M = .^. (p-^ocg- dy + ly 

\C\ ^ pq 



o r ^'Ls ' )■ (71) 



Thus: 



-di (dy ~ 1) 



^(/Ui,u«,,...,u«j = l + 0(max{^,<5^^^}). (72) 
Moreover, since = lJ^{l + 0{dx/p)), f-(jx us/ jjy = /u? .u?* ,...,uf {^ + 0{dx/p)). This concludes: 

u" U" ) = l + 0(max{^,^,5^^^}), (73) 

and 

l|A;i„,Mlli = 0(d./p). (74) 

□ 

Proof of Proposition 3: First we prove the theorem for = 1. Without loss of generality assume 

Y = aiXi + a2X2 + ■■■ + auXk + aN, (75) 

where N is follows the standard normal distribution. Note that since g = 1, ai, • • • , afe are scalars. 
Defining b = S^/^a, the response Y can be written as: 

Y = aiZi + 02^2 + • • • + akZk + aN, (76) 

in which Zi, - ■ ■ , are i.i.d. standard normal random variables. Assume Ui, • • • , Up, Un represent the 
U-scores (which are in Sn-2) corresponding to Zi, - ■ ■ ,Zp,N, respectively. It is easy to see: 

^ 61U1 + 62 U2 + • • • + ftfcUfc + (TUn 

" ||6iUi+62U2 + --- + 6fcUfc + cTUjvir 

If U and V are the U-scores corresponding to two random variables, and r is the correlation coefficient 

between the two random variables, we have: 

, , 1 _ (min{||U-V||,||U + V||})2 
II 2 
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Let Vy^i represent the sample correlation between Y and Xj. Here, we want to upper bound prob{|rj;,i| < 
We have: 

prob{|rj/,i| < \ry^k+i\} = 

probU + < 

^ (min{||Ufc+i-U,||,||Ufe+i+U,||})^ ^ ^ ^^^^ 

prob{min{||Ui-U^||,||Ui + Uj,||} > 

min{||Ufe+i-Uj,||,||Ufc+i + Uj,||}} < (80) 

prob{||Ui-UJ| > 

min{||Ufc+i-Uj,||,||Ufc+i+Uj,||}} = 

prob{{||Ui -Vy\\> \\Vk+i -Vy\\} U 

{||Ui-U,||>||Ufe+i+U,||}} < 

prob{||Ui-U2,|| > ||Ufc+i-U2,||} + 

prob{||Ui-Uj,|| > ||Ujfc+i + Uj,||} = (81) 
2 prob{||Ui - VyW > ||Ufc+i - Uj,||}, (82) 

in which, the last inequality holds since U^+i is uniform over Sn-2 and is independent of Ui and U^. 
Therefore, it suffices to upper bound pi := prob{||Ui — \Jy\\ > ||Ufc+i — Uy||}. Define: 

V = 52U2 + • • • + bkUk, (83) 

and 

U*=V/||V||. (84) 



By symmetry, U* is uniform over Sn-2- Hence: 

feiUi + ||V||U 
|6iUi + ||V||U: 

Since ||V|| < I62I + • • • + l^fel^ we have: 



U^; = : "JLrU - (85) 



||V|| - \b2\ + ■ ■ ■ + \bk\ ci 
where ci := I62I + • • • + \bk\. Define: 



> „ , „ , =^, (86) 



i = cos-i(U^Ui), (87) 
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and 



It is easy to see that: 



?i = cos-i(U^U*). (88) 



sm Ui ci 

< — ^. (89) 



sin 02 \bi 
For eacli < 9 < tt, define: 

Pi{9) = max s.t. < t^. (90) 

Now fix the point Ui on S'n-2- Define f{9) as the probability distribution of 02- Also, define p{9) as 
the probability that the angle between the uniformly distributed (over S'n-2) point U^+i and is less 
than 9. Since Ui is independent of U* and Vk+i is independent of U^^, clearly: 

Pi.0) = / f{e')d9'. (91) 



We have: 



PI < r pm9)9)f{9)d9 
Jo 

{p{l3i{9)9)+pm7r - 9){7r - 9))) f{9)d9, (92) 

Jo 



where the last equality holds because f{6) = /(tt — 9). Noting the fact that: 

r-7r/2 



r p{9)f{9)d9= r ip{9)+ 

Jo Jo 
p{tt - 9))f{9)d9 = f{9)d9 = ^. (93) 



we conclude: 

r7r/2 



- ^) -p(/3i(7r - 9){7r - 9)))}f{9)d9. (94) 



1 /■'r/2 



Hence by (91), for any < 6'o < tt/2: 

Pi<o - I P^MfiO)d9, (95) 
in which 

p^M = p{9 + j,e)-p{9-j,9) 

= prob{0 -ji9<92<9 + 71^}, (96) 
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with 



71= mill l-^i(e) = l- max MB). (97) 

e„<e<TT-e„ e„<e<TT-e„ 



It is easy to check that 71 > 0. Therefore, since P'y^{9) is an increasing functions of for < ^ < 7r/2, 
we conclude: 

1 



pi<2- I PiMfme. (98) 



Choose ^0 so that = 2+^- ^® have: 



Pi< J-P7.(^/(2 + 7i)) r''' md9 
^ J-k/ 



t/(2+7i) 



1 



7r{l+7i)/(2+7i) /•'r/2 



f{9)d9 r f{9)d9 

Jn/(2+'yi) 



2 y7r(l-7i)/(2+7i) A/(2+7i) 
1 /■7r/2+7i7r/6 f^f^ 

<i^- f{G)d9 / f{9)d9 

^ J7r/2-7i7r/6 >/7r/2-7i7r/6 

<\-2[r'^ f{9)d9\ , (99) 

^ yJ7r/2-7i7r/6 / 

in wliich, the last inequality holds, since < 71 < 1. Defining Ai = sin(7r/2 — 7i7r/6) and using the 
formula for the area of the spherical cap, we will have: 

rTr/2 



/ fi9)de 

^7r/2— 7i7r/6 



7i((n - 2)/2, 1/2) - ((n - 2)/2, 1/2) 
2Ii((n-2)/2,l/2) 

in which 



(100) 



Ua,b) = f, J-—, (101) 



is the regularized incomplete beta function. Hence: 



Note that we have: 



-/2 fl t^^~^)/^/^/Y~tdt 

f(e)de = ' — ■ (102) 

,r/2-7iV6 2 /(J t(^-*)/yVT^tdt 

It t^''-^^/VVT^tdt 



> 



2/o^^t("-4)/2/^/r^dt 
2Jit(^-^yyVT^dt 

^ _ y(n-2)/2 

= ,(n-2)/2 - (103) 
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Hence: 



/ f(o)de > — 



' - —h . (104) 



2(ki + 1) 



Hence by (99): 



< Ap^^/^ - xr' < \r^'. (105) 

Therefore, pi decreases at least exponentially by n. 

Assume P{i) for 1 < i < A;, represents the probability that the active variable Xi is not among the 
selected k variables. By (82) and using the union bound we have: 

P(l) < 2{p - k)\^^~^^''^. (106) 

Similar inequalities can be obtained for P(2),-- - ,P{k) which depend on A2,--- , A^, respectively. 
Finally, using the union bound, the probability P that all the active variables are correctly selected 
satisfies: 

k 

P>l-2{p-k)J2 "^^/^ > 1 - 2k{p - A;)A^"-2)/^ (107) 

1=1 

where A := maxi<j<fc Aj. This concludes that if n = 0(logp), with probability at least 1 — 1/p the exact 
support can be recovered using PCS. 

For g > 1, by union bound, the probability of error becomes at most q times larger and this concludes 
the statement of proposition 3. □ 
Proof of Proposition 4: First we consider a two-stage predictor similar to the one introduced in previous 
section with the difference that the n samples which are used in stage 1 are not used in stage 2. Therefore, 
there are n and t — n samples used in the first and the second stages, respectively. Following the notation 
introduced in previous section, we represent this two- stage predictor hy n\{t — n). The asymptotic results 
for the n\{t — n) two-stage predictor will be shown to hold as well for the n\t two-stage predictor. 

Using inequalities of the form (106) and the union bound, it is straightforward to see that for any 
subset TT 7^ ttq of /c elements of {1, • • • the probability that vr is the outcome of variable selection via 
PCS, is bounded above by 2k{p — k)c^, in which < < 1 is a constant that depends on the quantity 

min ^^"'^\ , . (108) 

The expected MSB of the n\{t — n) algorithm can be written as: 

E[MSE]= PWE[MSE^]+p(7ro)E[MSE^J, (109) 

7re5^,7r^7ro 
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where is the set of all A;-subsets of {1, ■ • • ,p}, p(7r) is the probability that the outcome of variable 
selection via PCS is the subset tt, and MSE^r is the MSE of OLS stage when the indices of the selected 
variables are the elements of tt. Therefore using the bound (107), the expected MSE is upper bounded 
as below: 

E[MSE] < 2k{p -k) c;fE[MSE^] + 

(l-2A;(p-fc)c[J)E[MSE^], (110) 

Co is a constant that depends on the quantity (26). It can be shown that if there is at least one wrong 
variable selected (tt 7^ ttq), the OLS estimator is biased and the expected MSE converges to a positive 
constant M^^ as (t — n) — ^ 00. When all the variables are selected correctly (subset ttq), MSE goes to 
zero with rate 0{l/{t — n)). Hence: 

E[MSE] < 2k{p -k) c"^^ + 

(1 - 2k{p - k)cS)0{l/{t - n)) < 
2k{p - A;)CiC" + (1 - 2k{p - A;)C")C2/(t - n), (111) 

where C, Ci and C2 are constants that do not depend on ra or p but depend on the quantities X^jeTro ^] 
and minj6^„ |aj|/Eie,ro 

On the other hand since at most t variables could be used in OLS stage, the expected MSE is lower 
bounded: 

E[MSE] > e(l/t). (112) 

It can be seen that the minimum of (111) as a function of n, subject to the constraint (23), happens 
for n = 0{logt) if e(logt) < otherwise it happens for 0. If e(logt) < the minimum value 

attained by the upper bound (111) is @{l/t) which is as low as the lower bound (112). This shows that 
for large t, the optimal number of samples that should be assigned to the PCS stage of the n\{t — n) 
predictor is n = O(logt). As t ^ 00, since n = O(logt), the MSE of the n\t predictor proposed in Sec. 
V converges to the MSE of the n\ {t — n) predictor. Therefore, as t — )• 00, n = 0(Iog t) becomes optimal 
for the n|t predictor as well. □ 
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